1 Introduction

The goal of this market segmentation analysis is to develop user “personas” to inform future marketing efforts and product development. This analysis will use an unsupervised machine learning approach to cluster the users into distinct personas. Unsupervised clustering allows for a data driven way to infer structure within the data. The distinct clusters generated can then be interpreted to understand how Duolingo users might be similar or different from each other. Many of the survey data features was collected as categorical variables, this analysis will use a Kmodes clustering algorithm. Some numerical data variables wwere converted into categories as Kmodes can only handle categorical variables. Kmodes works by iteratively comparing the similarity of each new point to k centroids. The new data point is then clustered with the cluster that it is most similar to, and a new centroid is calculated. The distance between cluster and the new point is measured by dissimilarity (total mismatches between data points).

2 Data source

User survey and usage data from Duolingo. The survey asked users a series of questions about demographics (e.g., country, age, employment status), and motivation (e.g., primary reason for studying a language). The goal of this survey was to develop user segments (or personas) to inform future marketing efforts and product development. Usage data was collected from August 1, 2018 to November 5, 2018.

3 EDA Summary

EDA data analysis revealed a significant portion (57%) of the daily goal values were NAs. Over 50% of users from (MX, FR, JP, RU) have a purchased subscription. Interestingly these 4 countries also have the lowest percentage of users in the 0-10,000 salary range and also ahve a high number of lessons completed. Other variables such has commitment, employment status, and age is less clear to see a difference from these geographies. Additionally we found those who purchase the subscription also use the app more.

4 Results (3 Personas)

High Value Customer. - Most likely to purchase a subscription. - Very active with Duolingo app - Very committed to learning - High proportion of retirees - Generally, older (55 – 74). - Generally, earn more - Mixed language proficiency

New Language Students - Least likely to purchase a subscription - Generally younger (18-34) - Most earn less than 10k - Learning a language for the first time - Highest probability of being a student or unemployed

Working Adult Reviewer - Reviewing a language they have studied before - Generally middle age (35 – 54) - Generally earning $26k – $75k - Most likely to take a placement test - Highest employment rate

Recommendations for product changes and marketing campaigns - High Value Customer. | Consider developing a loyalty and referral program targeted for this group. Highlight referral scheme, as word of mouth is the best way to win new customers | Have dedicated service representatives if they have issues - New Language Students | Young, group of new language learns. Consider targeting campaigns that will expose them to multiple new languages to help them discover one that interest them. | Appeal to young people’s desire to experience new languages with travel marketing campaigns focus on travel - Working Adult Review | Most likely to review an old language, target notifications and marketing campaigns of relearning an old language. | Most likely to be working a job, consider sending notifications after working hours, when this group is most likely to be active on the app.

5 Key Visualizations

5.1 Data pre-processing.

Daily has more than 50% of values as NAs. Choose to remove this feature

5.2 Data pre-processing.

Longest_streak has a group of individuals with abnormally high longest_streaks. This could be a technical issue with how the data was collected Choose to remove this abnormal group. Other features such as n_lessons_started and n_days_on_platform look as we expect

5.3 Data pre-processing.

What is the distribution of time spent completing the survey? We do not want survey results that are inaccurate. Histogram of time spent on survey. Choose to remove users who did not spend at least 100 seconds (log10(2)) filling out the survey.

5.4 Exploratory data analysis.

As expeted those who purchase a subscription are more active.

Heatmap shows number of active days, lessons started, lessons_completed, and highest crown count seem to correlate with each other. And an overall trend that most of these app usage features are positively correleated with each other

# Check correlations. In order to check correlations, we  1 hot encode the categorical variables purchased_subscription and took_placement_test
df_usage$purchased_subscription <- as.integer(as.logical(df_usage$purchased_subscription))
df_usage$took_placement_test <- as.integer(as.logical(df_usage$took_placement_test))

df_corr <- df_usage[,c("highest_course_progress", "took_placement_test", "purchased_subscription", "highest_crown_count",
                       "n_active_days","n_lessons_started","n_lessons_completed","longest_streak","n_days_on_platform")]
df_corr <- na.omit(df_corr)

### Get lower triangle of the correlation matrix
cormat <- round(x = cor(df_corr), digits = 2)
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri
##                         highest_course_progress took_placement_test
## highest_course_progress                       1                0.18
## took_placement_test                          NA                1.00
## purchased_subscription                       NA                  NA
## highest_crown_count                          NA                  NA
## n_active_days                                NA                  NA
## n_lessons_started                            NA                  NA
## n_lessons_completed                          NA                  NA
## longest_streak                               NA                  NA
## n_days_on_platform                           NA                  NA
##                         purchased_subscription highest_crown_count
## highest_course_progress                   0.19                0.66
## took_placement_test                       0.07                0.19
## purchased_subscription                    1.00                0.29
## highest_crown_count                         NA                1.00
## n_active_days                               NA                  NA
## n_lessons_started                           NA                  NA
## n_lessons_completed                         NA                  NA
## longest_streak                              NA                  NA
## n_days_on_platform                          NA                  NA
##                         n_active_days n_lessons_started n_lessons_completed
## highest_course_progress          0.37              0.26                0.27
## took_placement_test              0.07              0.13                0.13
## purchased_subscription           0.36              0.33                0.33
## highest_crown_count              0.55              0.52                0.52
## n_active_days                    1.00              0.50                0.50
## n_lessons_started                  NA              1.00                0.98
## n_lessons_completed                NA                NA                1.00
## longest_streak                     NA                NA                  NA
## n_days_on_platform                 NA                NA                  NA
##                         longest_streak n_days_on_platform
## highest_course_progress           0.34               0.45
## took_placement_test               0.02              -0.07
## purchased_subscription            0.25               0.10
## highest_crown_count               0.51               0.35
## n_active_days                     0.47               0.17
## n_lessons_started                 0.27               0.01
## n_lessons_completed               0.27               0.01
## longest_streak                    1.00               0.28
## n_days_on_platform                  NA               1.00
### Melt
melted_cormat <- melt(upper_tri, na.rm = TRUE)

### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

5.5 Exploratory data analysis

MX, FR, JP and RU have extremely high rates of subscription payments. Their users also share high levels of commitment to learning the language and have the same age profiles (relatively high proportion of 55-74 year olds) These 4 countries are all non-english native speaking. Denmark (DE) is another non-english speaking country with similar age demographics and high levels of commitment, but despite this, has a proportion of subscribes. DE market could be a un-tapped market. One strategy could to be lower the cost for subscriptionsi in Denmark to ease the point of entry

Denmark users have much few 151k+ earners compared to MX, FR, JP and RU. Consider lowing the subscription cost in this region.

5.6 Kmodes clustering

Most of the survey data is categorical data, let’s use k-modes to cluster them. The goal of this clustering is to identify certain groups within the data set. First we build our data frame and then use an elbow plot to determin the optimal number k clusters. From the elbow plot we select 3.

The modes of our clusters

##       age     annual_income  employment_status                 student
## 1 35 - 54 $26,000 - $75,000 Employed full-time Not currently a student
## 2   18-34      $0 - $10,000 Employed full-time Not currently a student
## 3 55 - 74 $26,000 - $75,000 Employed full-time Not currently a student
##                       duolingo_subscriber
## 1 No, I have never paid for Duolingo Plus
## 2 No, I have never paid for Duolingo Plus
## 3  Yes, I currently pay for Duolingo Plus
##                           primary_language_commitment
## 1 I'm moderately committed to learning this language.
## 2 I'm moderately committed to learning this language.
## 3       I'm very committed to learning this language.
##                                          primary_language_review
## 1  I am using Duolingo to review a language I've studied before.
## 2 I am using Duolingo to learn this language for the first time.
## 3 I am using Duolingo to learn this language for the first time.
##   primary_language_proficiency took_placement_test n_lessons_completed_cat
## 1                 Intermediate                   1                       2
## 2                     Beginner                   0                       1
## 3                     Beginner                   1                       3
##   purchased_subscription
## 1                      0
## 2                      0
## 3                      1

6 Main figure

Radar chart summarizes the key attributes of each cluster. High Value Customer. - Most likely to purchase a subscription. - Very active with Duolingo app - Very committed to learning - High proportion of retirees - Generally, older (55 – 74). - Generally, earn more - Mixed language proficiency

New Language Students - Least likely to purchase a subscription - Generally younger (18-34) - Most earn less than 10k - Learning a language for the first time - Highest probability of being a student or unemployed

Working Adult Reviewer - Reviewing a language they have studied before - Generally middle age (35 – 54) - Generally earning $26k – $75k - Most likely to take a placement test - Highest employment rate

6.1 Subscription purchases by cluster

The High Value Customer cluster has users who are willing to pay for a subscription

## Committment to learning a language High Value customers are more committed to learning a language ## Age breakdown by clusters. High Value Customers tend to be older while New Language Students are younger. ## Income breakdown by cluster.

6.2 Reviewing or learning new language by cluster

3d plots wiht plotly shows our 3 clusters with course progress, purchased_subscription, and active days